Can Clustered File Systems Support Data Intensive Applications?
نویسندگان
چکیده
This WIP attempts to address the question—Can cluster file systems match specialized file systems such as Google's GFS for data-intensive applications? With the explosive growth of information and applications exploiting that information, large-scale data processing has emerged as an important challenge. Example applications include web search, indexing and mining, discovering biological functions from genomic sequences, astronomical phenomena from telescope imagery, and brain-scale networks for cog-nitive systems. These data-intensive applications demand a scalable, yet cost-effective storage layer. The storage layer for these applications needs to scale to thousands of nodes so that highly parallel applications process petabytes of data in hours rather than days. At the same time, the infrastructure needs to be built on commodity components to minimize cost while tolerating the failures that are typical for these components. Given the large volumes of data being processed, another key requirement for this storage layer is to enable shipping compute to data rather than the other way around. Recently, enterprises faced with these critical needs of data-intensive applications have proposed specialized file systems, built with the unique requirements of the layer in mind. For example , Google developed the GFS file system that is optimized for large sequential and small random reads on a small number of large files residing on a commodity cluster. Companies such as Yahoo and Kosmix followed this trend by emulating the GFS architecture in Hadoop DFS and KFS respectively. For the scope of this work, we choose the open source Hadoop DFS (HDFS) as a representative specialized file system. This work argues that cluster file systems can also rise to the challenges posed by these data-intensive applications. Moreover, there are inherent advantages to using cluster file systems in this paradigm: (1) These file systems can provide well-known traditional file APIs to these new class of applications. (2) Given that these file systems have been around for a while, they are enabled with a rich set of management tools such as automated backup and disaster recovery etc. (3) These file systems can simultaneously support legacy applications that rely on traditional file APIs obviating the need to maintain different storage layers for different applications. (4) Finally, an interesting trend further motivates this study: enterprises are increasingly incorporating data analytics in their workflows resulting in a mix of legacy and the new class of data-intensive applications accessing a common storage layer. There is ample evidence that existing cluster file …
منابع مشابه
Checkpointing Orchestration for Performance Improvement
Checkpointing is a mostly used mechanism for supporting fault tolerance of high performance computing (HPC), but notorious in its expensive disk access. Parallel file systems such as Lustre, GPFS, PVFS are widely deployed on super computers to provide fast I/O bandwidth for general data-intensive applications. However, the unique feature of checkpointing makes it impossible to benefit from the ...
متن کاملHigh-Performance Storage Support for Scientific Big Data Applications on the Cloud
This work studies the storage subsystem for scientific big data applications to be running on the cloud. Although cloud computing has become one of the most popular paradigms for executing data-intensive applications, the storage subsystem has not been optimized for scientific applications. In particular, many scientific applications were originally developed assuming a tightly-coupled cluster ...
متن کاملTowards a Next Generation Distributed Middleware System for Many-Task Computing
Distributed computing systems have evolved over decades to support various types of scientific applications and overall computing paradigms have been categorized into HTC (High-Throughput Computing) to support bags of tasks which are usually long running, HPC (High-Performance Computing) for processing tightly-coupled communication-intensive tasks on top of dedicated clusters of workstations or...
متن کاملPanache: A Parallel File System Cache for Global File Access
Cloud computing promises large-scale and seamless access to vast quantities of data across the globe. Applications will demand the reliability, consistency, and performance of a traditional cluster file system regardless of the physical distance between data centers. Panache is a scalable, high-performance, clustered file system cache for parallel data-intensive applications that require wide a...
متن کاملData - intensive file systems for Internet services : A rose by any other
Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009